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ABSTRACT 


The  dissected  direct  solution  of  f inito-clcment  grids  has  bee*, 
shown  to  be  more  efficient  than  band-related  methods.  In  this 
paper,  it  is  shown  that  vector  processors  can  be  used  to 
solve  such  grids  despite  their  apparent  disassociated  structure. 
Operation  counts  and  vector  counts  are  given  as  a function  of 
grid  size  ar.d  order  of  rectangular  element. 


INTRODUCTION 

The  architecture  of  fourth  generation  scientific  processors  (vec- 
tor processors)  places  a high  premium  on  regularity  of  problem 
data  structure.  In  the  class  of  sparse  matrix  problems,  such 
structure  arises  most  obviously  from  the  direct  solution  of 
finite  element  (or  finite  difference)  grids  resulting  from  the 
solution  of  partial  differential  equations  (Ref.i). 

Traditionally,  such  grids  are  solved  either  iteratively  or  with 
band-related  methods.  Although  suitable  for  vector  processing, 
band  methods  are  now  known  to  be  computationally  inefficient  for 
all  but  the  smallest  problems.  With  benchmarked  results  (Ref. 2) 
showing  vector  processing  rates  14  to  GO  times  those  of  the 
fastest  scalar  processors,  problem  size  is  growing  to  the  point 
that  failure  to  use  these  more  elaborate  but  efficient  dissec- 
tion methods  (Ref. 3)  will  result  in  an  order  of  magnitude  loss 
in  computing  time,  clearly  an  unacceptable  pries.  On  the  other 
hand,  dissection  metnods  necessitate  processing  relatively  small 
matrices  in  the  early  stages  of  solution,  and  so  are  unfavorable 
to  vector  processing  (Ref, 3).  As  the  dissected  solution  process 
progresses,  however,  increasingly  larger  (full)  matrices  are 
encountered.  It  has  been  an  open  question  whether  the  small 
matrices  of  the  initial  phases  can  be  compensated  by  the  later 
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large  matrices,  so  that,  overall,  the  problem  appears  sufficient- 
ly structured  to  justify  vector  processing. 

Another  complication  introduced  by  size  is  the  need  for  a backing 
store  to  keep  intermediate  results.  In  reducing  the  computa- 
tional complexity  by  dissection,  one  risks  being  overwhelmed  by 
reading  and/or  writing  to  memory.  To  meet  space  limitations, 
consideration  of  this  problem  has  been  deleted  from  this  paper 
but  appears  in  Ref.  3. 

In  this  paper,  expressions  are  developed  for  operations  and  vec- 
tor counts  of  dissected  finite  element  grids  as  functions  of  the 
grid  size,  variables/node,  and  nodes/element.  The  vectoriza- 
bility  of  certain  common  finite  element  problems  is  studied  using 
simple  models  of  current  vector  processors. 

The  discussion  will  assume  that  the  reader  is  familiar  with  dis- 
section strategies  as  described  by  George  (Ref.l),  as  well  as 
with  concepts  and  terminology  associated  with  solution  of  prob- 
lems on  vector  processors  (Ref. 2). 

A DISSECTION  ALGORITHM  FOR  ARBITRARY 
RECTANGULAR  ELEMENTS 

Consider  the  sequence  of  rectangular  finite  elements  illustrated 
in  Fig.  1(a),  with  m+2  nodes/side  and  Z variables/node  .*  It  is 
assumed  that  all  variables  associated  with  an  element  are  coupled 
to  one  another,  represented  by  a full  matrix  if  the  variables 
were  ordered  consecutively  in  the  grid. 

Now  consider  a square  grid  of  2n-2  elements/side  as  illustrated 
in  Fig.  1(b).  The  grid  is  bisected  in  both  dimensions  until 
only  trivial  bisections  remain.  The  nodes  are  eliminated  in 
reverse  order,  taking  care  that  nodes  along  the  periphery  are 
eliminated  as  soon  as  possible.  For  example,  in  Fig.  2,  a local 
corner  segment  of  a grid  of  quadratic  elements  is  illustrated; 
all  nodes  marked  "1"  are  eliminated  consecutively,  followed  by 
nodes  marked  "2",  etc.  If  each  node  represents  Z variables, 
these  variables  are  eliminated  consecutively  in  any  order. 

OPERATION  AND  VECTOR  COUNTS 
Introduction 

The  characterization  of  vector  operations  arising  from  a dis- 
sected finite  element  solution  involves: 

(a)  identification  of  vector  operations  from  a matrix 
structure; 


♦The  terminology  "linear,"  "quadratic,"  and  "cubic"  will  be  used 
to  describe  elements  with  2,  3,  and  4 noces/side;  it  should  be 
noted,  however,  that  certain  classes  of  higher-order  elements 
result  in  more  variables/node  rather  than  mere  nodes/side 
(e.g.,  Hermite  bicubic). 


m+2  Nodes /Side 


(a)  Element  models 


2n  -2  Elements/Side 


(b)  Dissected  grid  example  (m=l,n*=3) 


Figure  1.  Finite  element  and  dissected  grid  example  * 
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Figure  2.  Quadrant  of  finite  element  grid.  Numbers 

joined  by  dashes  constitute  a single  vector, 


(b)  relation  of  matrix  structure  to  grid  properties; 
combined  with  (a) , this  allows  vector  operations  to 
be  directly  related  to  the  grid  structure; 

(c)  developing  general  expressions  for  vector  counts, 
with  undetermined  coefficients; 

(d)  solving  for  the  coefficients  by  counting  low-order 
cases. 

The  arithmetic  operation  counts  are  developed  similarly  to  vec- 
tor counts,  using  counting  formulae  developed  in  Ref.  1.  It 
should  be  noted,  however,  that  the  procedure  proposed  here  does 
not  involve  symbolic  reduction  to  arrive  at  counts  as  in  Ref.  i, 
but  instead  solves  for  the  coefficients  numerically. 
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Simple  Vector  Counts 
Consider  the  equations 

Ax  = b (1) 

where  A is  a constant  a x a matrix,  structurally  symmetric 
but  numerically  unsvmmetric, 
b is  a constant  ct  x 1 matrix, 
x is  an  unknown  a x 1 matrix. 

Then  define  the  LU  factorization  LU  = A where  L and  U are  lower 
and  upper  triangular  matrices,  respectively,  with  unit  diagonal 
entries  for  L.  Then  x is  found  by  solving  Ly  = b and  Ux  = y. 

One  manner  of  viewing  this  factorization  is  based  on  the  outer 
product  calculation  of  Fig.  3a.  For  counting  purposes,  a full 
matrix  data  structure  is  assumed;  however,  simple  vectors  are 
recognized  and  ordered  down  coluns  of  L.  The  r-th  simple  vector 

a(I(i(r),j(r)),N(r))  = [a^  a2  r ..  aN(r)>rlT 

begins  at  position  (i(r),j(r))  in  the  full  matrix  structure  and 
has  length  N(r).  The  index  function  I ( i , j ) gives  the  starting 
location  of  a vector  in  the  column-ordered  packed  matrix  obtained 
by  eliminating  zero-valued  positions.  In  Fig.  3a,  the  k-th  pivot 
step  is  illustrated,  and  involves  operations  on  t simple  vectors 
in  the  k-th  column  of  A. 

Consider  now  in  de*  1 the  process  of  pivoting  at  the  (k,k) 
diagonal  position.  ie  r-th  vector  is  the  first  vector  identi- 
fied below  the  (k,  iagonal  entry.  We  define 

t 

s = £ N (r  + w - 1) 

w=l 

where  s is  the  number  of  non-zero  positions  in  the  k-th  column. 
The  k-th  pivot  process  begins  by  a reciprocation  of  the  k-th 
pivot  element  and  a multiply  of  the  k-th  column  by  the  recipro- 
cated pivot  (1  vector  multiply,  s arithmetic  operations  not  in- 
cluding the  divide) . The  entire  k-th  column  is  then  multiplied 
by  the  q-th  element  of  the  k-th  row  (1  vector  multiply  and  s 
arithmetic  operations  for  each  row  element) . The  resultant  vec- 
tor is  subtracted  from  the  q-th  column?  however,  the  subtraction 
in  general  requires  t vector  operations  corresponding  to  the  t 
vectors  in  the  k-th  column  which  must  be  scattered  in  the  sub- 
traction from  the  differently-structured  q-th  column.  This 
accounts  for  t vector  operations  and  s arithmetic  operations  for 
each  of  the  s components  of  the  k-th  row.  The  total  number  oper- 
ations is  easily  summed  at  the  k-th  pivot  as 


VEC(k)  = s(t  + 1)  + 1 
OPS (k)  = s (2s  + 1) 


(2) 

(3) 


High  Level  Vectors 

The  above  gives  a simple  vector  count  (Ref. 2)  since  all  vector 
operations  can  be  performed  in  a single  loop.  A processor  cap- 
able of  handling  three  levels  of  loop  nesting  allows  the  multi- 
plication of  any  two  conformable  matrices  within  one  vector  oper- 
ation. If  such  block  structure  can  be  recognized  within  a prob- 
lem, then  fewer  vector  operations  will  be  required  vis-a-vis  a 
simple  vector  implementation. 

Define  a completely  coupled  set  of  nodes  as  a set  where  (1)  all 
nodes  are  coupled  to  one  another,  and  (2)  a node  not  in  the  set 
is  coupled  to  one  member  of  the  set  if  and  only  if  it  is  coupled 
to  all  members  of  the  set.  This  property  gives  rise  to  a block 
matrix  structure  (Fig.  3b)  similar  to  the  single  row/ column 
structure  of  Fig.  3a,  so  that  vectors  can  be  similarly  counted. 

A summary  of  the  vector  operations  for  an  m * m block  coupled  to 
t other  blocks  is 

Pivot  broadcast  multiplies:  m 

Outer  product  multiplies:  m 

Subtractions:  (m  - l)fc2  + t2 

where  t = t if  blocks  A^i  and  A^2  are  adjacent  in  the  matrix 

t = t + 1 if  blocks  A^  and  A^2  are  nonadjacent. 

Thus,  at  the  k-th  block  elimination,  the  vector  count  is 

VEC(k)  = 2m  + (m  - l)t2  + t.  (4) 

Note  that  this  count  does  not  involve  the  size  of  the  blocks  to 
which  the  k-th  block  is  connected. 


Grid  Recognition  of  Vectors 

The  next  step  in  the  determination  of  an  expression  for  vector 
count  is  the  ability  to  recognize  vectors  in  the  finite  element 
grid.  The  general  rule  for  such  recognition  is  the  following: 

If  at  some  stage  in  the  elimination  process,  the 
node  being  eliminated  (node  1 in  Fig.  4)  is 
coupled  to  a set  of  consecutively  numbered 
nodes,  nk,  *^+1' • • • »n.+r  (e.g. , 30-36  in  Fig.  4), 

then  consecutive  identical  operations  involving 
the  node  set  can  be  performed  in  a simple  vector 
mode . 

For  arbitrary  node  ordering,  such  vectors  are  not  readily  identi- 
fied visually;  however,  the  dissection  process  introduces  an  or- 
dering along  rows  and  columns  of  the  grid  as  illustrated  in  Fig. 
4,  so  that  consecutively  numbered  nodes  are  usually  adjacent  in 
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(a)  Vector  counting  procedure  at  kth  pivot  step 
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(b)  Fill  pattern  of  blocked  sparse  matrix 
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Figure  4.  Relation  of  grid  ordering  to  matrix 
vectors. 


the  grid  structure.  Thus,  vectors  can  be  identified  by  a line 
joining  adjacent  nodes  of  a set,  as  shown  in  Figure  4.  The  num- 
ber of  vectors  associated  with  the  elimination  of  selected  inter- 
ior nodes  of  the  example  are  given  in  Table  1.  It  is  especially 
noteworthy  that  the  elimination  of  the  last  node  of  a node  set 
(e.g.,  node  2 in  Fig.  4)  involves  one  fewer  vectors  than'elimin- 
ation  of  the  previous  vector  nodes. 

Higher  lead  vectors  can  be  similarly  identified  by  coupling 
between  groups  of  nodes. 


Table  1.  Vector  Count  for  Sample  Node  Elimination  in  Fig.  4 


Node 

Eliminated 

Node  Sets 

.No.  of 
Vectors 

1 

2,  5-9,  30-36,  51-53,  82-84 

5 

2 

5-9,  30-36,  51-53,  82-84 

4 

3 

4-9,  53-63,  80-82 

3 

5 

6-9,  30-36,  51-63,  80-84 

4 

9 

30-36,  51-63,  80-84 

3 

9 


Complexity  of  Vector  and  Operation  Counts 

In  Ref.  4,  it  is  shown  that  vector  counts  for  dissected  grids  a>-e 
of  the  general  form 

VEC  = zq  + z^n + (z2  + z3n) 2n  + (z4  + z5n) 22n  + (zg  + z?n) 23n  (5) 


where 


zi  = wi;L£  + wi2£2+  wi3&2  + (wi4£  + wi5Jl2  + wig^3)m 

2 3 2 *>  3 3 

+ (wi7£+wi8J,  +wi9«-  )m  + ^HQl  + will2'*"  + wil2Jl  )m 


Similar  formulae  apply  to  operation  counts  (Ref.  4) , with  certain 
coefficients  being  zero-valued  in  each  case. 

Coefficients  are  found  by  counting  the  vectors  and  operations 
for  a number  of  small  values  of  n,  l,  and  m using  (2)  and  (3) 
and  solving  for  the  wj.  j . For  example,  84  grids  were  simulated 
to  obtain  the  operation  count,  and  36  for  the  simple  vector 
count.  The  results  are  summarized  in  Table  2. 


VECTORIZ ABILITY  OF  THE  SOLUTION 
Average  Vector  Length 

In  Ref.  2 it  is  shown  that  the  fraction  of  the  computation  time 
devoted  to  arithmetic  operations  in  a pipelined  processor  is 

„ arithmetic  time 


startup  time  + arithmetic  time 


where  T is  the  arithmetic  operation  time,  Tg  the  vector  startup 

time,  and  £.  the  length  of  the  i-th  vector.  Although  T /T  is 
i n op'  s 

a machine  parameter,  the  quantity  .E.  l./n  = L is  problem- 

dependent,  and  identifiable  as  the  "average  vector  length."  It 
may  be  computed  from  Lave  = OPS/VEC,  the  latter  given  in  (6-10). 

Its  most  important  property  is  that  when  L = T /T  , 1/2  of 

ave  s op 

the  total  computation  time  is  devoted  to  vector  startup. 


Empirical  Study 

The  average  vector  lengths  resulting  from  a number  of  finite  ele- 
ment grids  have  been  determined  from  Eq.  (6-10).  These  range 
from  single  variable  cubic  problems  perhaps  associated  with 


DPS  = [(-|! I + 18|£  + 126^1'')  + (-J4  + 22*  + 9 7 6 jj-*  ) m 

+ (-1U2  + 1543j|£3)m2  + 695y43m3]  + 2n[(~£  - 4142  - 223^-43 

+ (4  - 8442  - 41043)m  + (-3542  - 5043)m2  + 13443m3] 

+ 22n[(-|i  + 23^42  + 4643 ) + (-j4  + 3642  - 228|43)m 

+ (1242  - 573^3)m2  - 29843rn3] 

J (6) 

+ 23n[39y43  + 118y£3m  + 118|43m2  + 39y43m3]  m > 0 

+ n[(442  + 1643)  + ( 842  + 4843)m  + 3243m2] 

+ n2n[  (-242+10043)  + (2042+43643)m+(2242+48443)m2+148il3m3] 

+ n22n{(-ljl2  - 9943)  + (-15^-4  2 - 19843 )m  + ( - 2 - 9943)m2] 
Simple  Vectors 


VEC  = 6-  4-  100^4 2 + (6^-4  + 25242)2n  + (-7^4  - 159^2)22n  (7) 

+ (-104  - 3442 ) n + (114  + 4442)n2n  + 44^42n22n  m = 0 

VEC  = [(14j4  - 100|-42)  + (-11^4-  162|42)m  + (5j42)m2] 

.6  6 Z 6 b 

+ 2n [ ( 10^4  + 25242)  + (29^4  + 566i42)m  + (27li42)m2] 

+ 2 2n [ ( - 8 j^-4  - 1 5 9 ^4 2 ) + (-I0J4  - 255j42)m  + ( - 9 2 ) m2 3 

+ n [ (-10  4-3442)  + (-204  - 8042)ra  + (-2042)m2] 

+ n2n [ (114  + 4 4 4 2 ) + (114  - 6642)m  + (-11042)m2]  (8) 

+ n22n [ 44^-4  2 + 89j42  m + 44|42m2]  m-  - 1 

High  Level  Vectors 

VEC  = -124-6842+  (124h-32^-42)  2n+  ( -3^4+2 5^-4 2 ) 22n+1242n-5442n2n  (9) 


VEC  = ( (84-684iJ)  + (1334^)m]  + 2n  [( 164+32-4  )- (125^r)  m] 
+22n[  (-64+25^42)  + ( 49^-4 2 )m]  + n[(1242)+ (5642)m] 

+ n2n[  (-5442)  -(5442)m]  m>  1 


stress  analysis  problems  to  ten-variable  linear  atmospheric/ 
ocean  model  elements.  The  results  are  plotted  in  Fig.  5,  to- 
gether with  an  indication  of  the  solution  time  (not  including 
input/output)  on  the  Texas  Instruments  1-pipe  Advanced  Scien- 
tific Computer.  Two  cases  are  shown,  the  first  for  simple  vec- 
tors, the  latter  for  higher  level  vectors. 

For  the  simple  vector  case  of  Fig.  5a,  we  observe  that  Lave  - ■? 
for  all  matrices  requiring  1 minute  on  the  ASC;  thus  a 
medium  sized  transient  problem  could  be  solved  with  the  effi- 
ciency of  9^.5,  since  Ts/Top  s 60  for  the  ASC.  On  the  other  har 
a 1-hour  static  problem  would  result  in  Lave  > 140  and  an 
efficiency  of  .74.  On  the  Cray-1,  with  Ts/Top  =10,  an  effi- 
ciency of  .5  would  be  achieved  with  any  problem  requiring  more 
than  =1  second  per  solution. 

It  is  especially  interesting  to  compare  the  analytic  expression 
for  arithmetic  operations  and  average  vector  lengths  of  banded 
and  dissected  solutions.  The  asymptotic  counts  are  (n  00  , 
£-*“>,  m = 0) 

banded : 


dissected: 


The  OPS  are  less  for  dissection  when  n b 6.  The  LaVe  i-s  always 
larger  for  a banded  solution  than  a simple  vector  dissection, 
but  a high  level  vector  dissection  is  asymptotically  more  effi- 
cient than  a simple  vector  banded  solution. 

Figure  5b  further  demonstrates  the  considerable  advantage  to  be 
gained  in  using  higher-level  vectors.  The  average  vector  lenct 
plotted  on  the  same  scale  as  Fig.  5a,  is  well  above  60  for  all 
but  the  most  trivial  problem.  In  this  case,  vectorization  over- 
head can  clearly  be  ignored. 


OPS: 

L 

ave : 
OPS: 

Lave ' 
(Ref .3) 

Lave* 


£324n 


£2 


n 


(simple  vector) 


(394-)  £323n 
. 882£2n/ (n  - 3.55) 


(simple  vector) 


1 . 55£2n  (high  level  vector) 


CONCLUSIONS 

• 

The  vector  counts  given  in  this  paper  assumed  a data  structure 
that  favored  highly-coupled  systems;  e.g.,  L increased  with 
m and  £,  which  both  increase  the  coupling  within  an  element. 
Alternatively,  the  equation  decoupling  in  the  early  stages  of 
dissection  could  be  exploited  as  well  (Refs.  7,8),  but  a differ- 
ent data  structure  would  be  required.  Seemingly,  an  optimal 
strategy  would  involve  transitioning  from  one  structure  to 
another  as  the  connectivity  grows  during  solution.  This  is  an 
interesting  but  unsolved  problem. 
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(b)  High-level 
vector 


Figure  5.  Average  vector  lengths 
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Apart  from  the  concrete  complexity  results  presented  here,  the 
observation  that  problem  dissection  does  not  rule  out  vectori- 
zation  has  significance  to  other  computational  problems  as  well. 
For  as  shown  by  Rose  (Ref. 4),  dissection  is  a form  of  "divide 
and  conquer"  algorithm  (Ref. 5),  an  efficient  general  solution 
strategy  which  on  first  inspection  does  not  appear  to  favor 
vectorization.  The  fact  that  the  fill  during  solution  compen- 
sates for  the  initial  disassociated  structure  is  perhaps  the 
most  interesting  of  these  results. 
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